(N 
> 



Down the Rabbit Hole: Robust Proximity Search and 
Density Estimation in Sublinear Space* 

Sariel Har-Peled^ Nirman Kumar* 



O : August 14, 2012 

(N 

< 
o 



Abstract 



For a set of n points in TR d , and parameters k and e, we present a data structure 
that answers (1 + e)-approximate k nearest neighbor queries in logarithmic time. Sur- 
^| prisingly, the space used by the data-structure is 0(n/k); that is, the space used is 

■ sublinear in the input size if k is sufficiently large. Our approach provides a novel way 

c/3 . to summarize geometric data, such that meaningful proximity queries on the data can 

be carried out using this sketch. Using this we provide a sublinear space data-structure 
that can estimate the density of a point set under various measures, including: (i) sum 
of distances of k closest points to the query point, and (ii) sum of squared distances of 
I k closest points to the query point. Our approach generalizes to other distance based 

estimation of densities of similar flavor. 

ov 

1 Introduction 

Given a set of n points P in IR d , the nearest neighbor problem is to construct a data struc- 
ture, such that for any query point q it (quickly) finds the point closest to q in P. This is an 
^ ■ important fundamental problem in computer science |SDI06l ICha08l IAI081 ICla06] . Appli- 

cations of nearest neighbor search include pattern recognition |FH49t ICH67] , self-organizing 
maps |Koh01j . information retrieval [SWY75] . vector compression |GG91j . computational 
statistics [DW82] . clustering jDHSOl] . dat a mining, learning, and many others. If one is 
interested in guaranteed performance and near linear space, there is no known way to solve 
this problem efficiently (i.e., logarithmic query time) in dimension d > 2. 

A commonly used approach for this problem is to use Voronoi diagrams. The Voronoi 
diagram of P is the decomposition of H d into interior disjoint closed cells, so that for 
each cell C there is a unique single point p £ P such that for any point q £ int(C) the 
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nearest-neighbor of q in P is p. Thus, one can compute the nearest neighbor to q by a 
point location query in the collection of Voronoi cells. In the plane, this approach leads 
to O(logn) query time, using 0(n) space, and preprocessing time O(nlogn). However, in 
higher dimensions, this solution leads to algorithms with exponential dependency on the 
dimension. The complexity of a Voronoi diagram of n points in IR^ is (n^/ 2 !) in the 
worst case. By requiring slightly more space, Clarkson [Cla88] showed a data-structure with 
query time O(logn) time, and 0(n^ d ^ +s ) space, where 5 > is a prespecified constant (the 
0{ ) notation here hides constants that are exponential in the dimension). One can tradeoff 
the space used and the query time |AM93j . Meiser [Mei93j provided a data-structure with 
query time 0(d 5 log n) (which has polynomial dependency on the dimension), where the 
space used is O (n d+<5 ) . Therefore, even for moderate dimension, the exact nearest neighbor 
data structure uses an exorbitant amount of storage. It is believed that there is no efficient 
solution for the nearest neighbor problem when the dimension is sufficiently large |MP69j : 
this difficulty has been referred to as the "curse of dimensionality" . 



Approximate Nearest Neighbor (ANN). In light of the above, major effort went into 
developing approximation algorithms for nearest neig hbor search |AMN+98l HM981 IKOROOl 
ISDI06t ICha08t IAI081 ICla06l IHIM12j . In d dimensional Euclidean space, one can answer 
ANN queries, in 0(logn + l/e d ~ l ) time using linear space |AMN+98[IIiarTT] . Because of the 
l/e d ~ l in the query time, this approach is only good for low dimensions. Interestingly, for this 
data-structure, the approximation parameter e is not prespecified during the construction; 
one can provide it during the query. An alternative approach is to use Approximate Voronoi 
Diagrams (AVD), introduced by Har-Peled [HarOlj . which are partition of space into regions, 
desirably of low total complexity, with a representative point for each region that is an 
ANN for any point in the region. In particular, Har-Peled showed that there is such a 
decomposition of size 0((n/e d )\og 2 n). This allows ANN queries to be answered in O(logn) 
time. Arya and Malamatos |AM02j showed how to build AVDs of linear complexity (i.e., 
0(n/e d )). Their construction uses Well Separated Pair Decompositions |CK95j . Further 
tradeoffs between query and space for AVDs were studied by Arya et al. |AMM09] . 



&;-nearest neighbor. A more general problem is the /c-nearest neighbors problem where 
one is interested in finding the k points in P nearest to the query point q. This is widely used 
in pattern recognition, where the majority label is used to label the query point. Here we 
are interested in the more restricted problem of computing/ approximating only the distance 
to the fcth nearest neighbor. This problem is widely used for density estimation in statistics, 
with k £s ^fn [Sil86j . It is also used in meshing (with k = d) to compute the local feature size 
of a point set in IR d |Rup95| . The problem also has applications in non-linear dimensionality 
reduction - finding low dimensional structures in data; more specifically low dimensional 
submanifolds embedded in Euclidean spaces. Algorithms like ISOMAP, LLE, Hessian-LLE, 
SDE and others, all use the k nearest neighbor as a subroutine |Ten98j IBSLTOOt IMS944 
IWS06j . 

Density estimation. Given several distributions pit, . . . , pL^ defined over IR d , and a query 
point q, we are interested in computing the a posteriori probabilities of q being generated 
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by each of these distributions. This approach is used in unsupervised learning as a way to 
classify a new point. Naturally, in most cases, the distributions are given implicitly; that 
is, one is given a large number of points sampled from each distribution. So, let \x be such 
a distribution, and P be a set of n samples. To estimate the density of \i at q, a standard 
Monte Carlo technique is to consider a ball B centered at q, and count the number of points 
of P inside B. Naturally, if P D B is too small, this estimate is not stable. Similarly, if B is 
too large, then the estimate is too smoothed out, taking into account samples that are too 
far away. One possible approach to address this issue, that is used in practice [DHSOlj . is 
to find the smallest ball centered at q that contains k points of P and use this to estimate 
the density of fi. Choosing the right value of k has to be done carefully - if it is too little, 
then the estimate is unstable, and if it is too large, it either requires the set P to be larger, 
or the estimate is too "smoothed" out to be useful (values of k that are used in practice are 
0{\fn)). See Duda et al. jDHSOl] for more details. To do such density estimation, one need 
to be able to answer, approximate or exact, k- nearest neighbor queries. 

Sometimes one is interested not only in the radius of this ball centered at the query point, 
but also in the distribution of the points inside this ball. The average distance of a point 
inside the ball to its center, can be immediately computed from the sum of distances of 
points inside the ball to the center. Similarly, the variance of this distance can be computed 
from the sum of squared distances of the points inside the ball to the center of the ball. As 
mentioned above, density estimation is used in manifold learning and surface reconstruction. 
For example, Guibas et al. jGMMll] recently used a similar density estimate to do manifold 
reconstruction. 

Answering exact &;-nearest neighbor queries. Given a point set P C ]R d , computing 
the partition of space into regions where the k nearest neighbors do not change is equivalent 
to computing the kth order Voronoi diagram. Via standard lifting, this is equivalent 
to computing the first k levels in an arrangement of hyperplanes in lR d+1 |Aur91j . More 
precisely, if we are interested in the kth nearest neighbor, we need to compute the (k — 1)- 
level in this arrangement. 

The complexity of the < k levels in a hyperplane arrang ement in ]R d+1 is 6(nL( d+1 )/ 2 J (k + 
1) r( d+1 )/ 2 l ) |CS89j . The exact complexity of the fcth level is not well understood and achieving 
tight bounds on its complexity is one of the long standing open problems in discrete geometry 
[Mat02j . In particular, via an averaging argument, in the worst case the complexity of the 
fcth order Voronoi diagram is VL{n^ d+l ^ 2 ^ [k + l)^" 1 " 1 )/ 2 !^ 1 ). As such, the complexity of /cth 
order Voronoi diagram is Q(nk) in two dimensions, and Q(n 2 k) in three dimensions. 

Thus, to provide a data-structure for answering k nearest-neighbor queries exactly and 
quickly (i.e., logarithmic query time) in IR d , requires computing the k- level of an arrange- 
ment of hyperplanes in H d+1 . The complexity of this structure is prohibitive even in two 
dimensions (this complexity determines the preprocessing and space needed by such a data- 
structure). Furthermore, naturally, the complexity of this structure increases as k increases. 
On the other end of the spectrum one can use partition-trees and parametric search to an- 
swer such queries using linear space and query time (roughly) 0{n 1 ^ 1 ^ d+1 ^ |Mat92llChalO| . 
One can get intermediate results using standard space/time tradeoffs [AE98j . 
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Known results on approximate k order Voronoi diagram. Similar to AVD, one can 

define a AVD for the k nearest neighbor. The case k = 1 is the regular approximate Voronoi 
diagram [Har01[ IAM02} IAMM09] . The case k = n is the furthest neighbor Voronoi diagram. 
It is not hard to see that it has a constant size approximation (see [Har99] . although it was 
probably known before). Our results (see below) can be interpreted as bridging between 
these two extremes. 

Quorum clustering. Carmi et al. |CDH + 05j describe how to compute efficiently a par- 
tition of the given point set into clusters of k points such that the clusters are compact. 
Specifically, this quorum clustering repeatedly computes the smallest ball containing k points, 
removes this cluster and repeats, see Section 12.2.11 for more details. Carmi et al. |CDH + 05] 
also describe a data-structure that can approximate the smallest cluster. The space of their 
data structure is 0(n/k), but it can not be directly used for our purposes. Furthermore, 
their data-structure is for two dimensions and it can not be extended to higher dimensions, 
as it uses additive Voronoi diagrams (which have high complexity in higher dimensions). 

Our results. 

We first show, in Section El that one can build a data-structure that answers /c-nearest 
neighbor queries approximately, up to a constant factor, with query time O(logn), where 
the input is a set of n points in lR d . Surprisingly, the space used by this data-structure is 
0(n/k). This result is surprising as the complexity decreases with k. This is in sharp 
contrast to behavior in the exact version of the kth order Voronoi diagram (where the 
complexity increases with k). Furthermore, for super-constant k the space used by this 
data-structure is sublinear. Furthermore, in some applications the value of k used is Q(y/n), 
and the space used is tiny fraction of the input size. This is a general reduction showing 
that such queries can be reduced to proximity search in an appropriate product space over 
n/k points computed carefully. 

In Section HI we show how to construct approximate /c-order Voronoi diagram using space 
0(e- d - l n/k) (here e > is an approximation quality parameter specified in advance). Using 
this data-structure one can answer (l+e)-approximate /c-nearest neighbor queries in O(logn) 
time. See Theorem 14.91 for the exact result. 

General density queries. We also show, in Section as an application of our data- 
structure, how to answer more robust kind of queries. For example, one can approximate (in 
roughly the same time and space as above) the sum of distances (or squared distances) from 
a query point to its k nearest neighbors. This is useful in approximating density measures 
[DHSOlj . Surprisingly, our data-structure can be used to estimate the sum of any function 
/(■) defined over the k nearest neighbor points that depends only on the distance of these 
points from the query point. Informally, we require that /(•) is monotonically increasing with 
distance, and it is (roughly) not super-polynomial. For example, for any constant p > 0, 
our data-structure requires sublinear space (i.e., 0(n/k)), and given a query point q, it can 
(1 + ^-approximate the quantity J2 u ex II u ~~ llfi where X is the set of k nearest points to 
q. The query time is logarithmic. 
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To facilitate this, in a side result that might be of independent interest, we show how 
to perform point-location queries in I compressed quadtrees of total size m simultaneously 
in 0(logm + J) time (instead of the naive 0(1 log m) query time) without asymptotically 
increasing the space needed. 

If k is specified with the query. In Section |6l given a set P of n points in IR^, we show 
how to build a data-structure, in 0(n log n) time and using 0(n) space, such that given a 
query point and parameters k and e, the data-structure can answer (1 + ^-approximate k 
nearest neighbor queries in 0(logn + l/e^ 1 ) time. Unlike previous results, this is the first 
data-structure where both k and e are specified during the query time. Previously, the data- 
structure of Arya et al. |AMM05] required knowing e in advance. Using standard techniques 
|AMN + 98] to implement it, should lead to a simple and practical algorithm for this problem. 

If k is not important. A relevant question is how to answer the approximate fc-nearest 
neighbor problem if one is allowed to also approximate k. Inherently, this is a completely 
different question that is considerably easier, and arguably less interesting. Indeed, the 
problem then boils down to using sampling carefully, and the problem loses much of its 
geometric flavor. We sketch how to solve this easier variant (this seems to be new) and 
discuss the difference with our main problem in Section 12.1.11 

Techniques used. We use quorum clustering as a starting point in our solution. In par- 
ticular, we show how such clustering can be used to get a constant factor approximation to 
the approximate k nearest neighbor distance using sublinear space. Next, we extend this 
construction and combine it with ideas used in the computation of approximate Voronoi di- 
agrams. This results in an algorithm for computing approximate /c-nearest neighbor Voronoi 
diagram. To extend this data-structure to answer the general density queries, as described 
above, requires a way to estimate the function /(•) for very few values (instead of k val- 
ues) when answering a query. We use a coreset construction to find out which values need 
to be approximated. Overall, our work requires combining several known techniques in a 
non-trivial fashion, together with some new ideas to get our new results. 

Paper organization. In Section [2] we formally define the problem and introduce some 
basic tools, including quorum clustering, which is a key insight into the problem at hand. The 
"generic" constant factor algorithm is described in Section El We describe the construction 
of the approximate /c-order Voronoi diagram in Section HJ In Section [5] we describe how to 
construct a data-structure to answer density queries of various types. In Section [6] we present 
the data-structure for answering k nearest neighbor queries that does not require knowing k 
and e in advance. We conclude in Section [7J 
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2 Preliminaries 



2.1 Problem Definition 

Given a set of n points P in IR d and a number 1 < k < n, consider a point q and order the 
points of P by their distance from q; that is, 

||q- Ui || < ||q- u 2 || < ••• < ||q- u n \\ , 

where P = {ui, u 2 , . . . , u n }. The point = nrifc(q, P) is the kth nearest neighbor of q and 
dfc(q, P) = ||q — Ufc|| is the kth nearest neighbor distance. The nearest neighbor distance 
(i.e., k — 1) is d(u, P) = min ve p ||u — v||. It is easy to verify that the function dfc(q, P) is 
1-Lipschitz as stated in the following. 

Observation 2.1. For any p, u G H d , k and a set P C ]R d , we have that dfc(u, P) < dfc(p, P) + 

IIp - 

The problem at hand is to preprocess P such that given a query point q one can compute 
Ufc quickly. The standard nearest neighbor problem is this problem for k — 1. In the 
(1 + e)- approximate kth nearest neighbor problem, given q, k and e > 0, one wants to 
find a point u G P, such that (1 — e) ||q — u&H < ||q — u|| < (1 + e) ||q — Uk\\. 

A note on the presentation of running times. Almost all of the algorithms we present 
have running times 0(exp(d)f(n)) where exp(cf) is a function exponential in d - like 2 d or e~ d 
where e is the approximation guarantee. We do not generally mention the factors like 2 d but 
we do mention factors like e~ d explicitly as is often customary in computational geometry 
approximations. This is justified since we assume d to be a constant, typically small, for the 
algorithms to be useful. Similarly e is a constant but it usually has more flexibility than d, 
in terms of varying between applications. 

2.1.1 An easier problem — if k is not important 

Consider another version of the approximate fc-nearest neighbor problem where one is allowed 
to approximate k. That is, given a query point, one has to return the (perhaps approximate) 
distance to a point which is a t- nearest neighbor to the query, where k < £ < (1 + e)k. 
As mentioned in the introduction, this is a completely different problem from the one we 
consider. Here we quickly sketch a solution to this variant (which seems to be new), and 
discuss the difference with the more interesting problem we solve in the rest of the paper. 

Indeed, one can sample the given point set, where each point is picked with probability 
0{k~ 1 e~ 2 \ogn). It is easy to verify that the 0(e~ 2 logn) nearest neighbor to the query 
(in the sample) is the required approximation with high probability. Using gradations one 
can precompute O(logn) samples that are appropriate to any value of k. Furthermore, one 
can easily reduce solving this problem to answering polylogarithmic number of queries using 
standard approximate nearest-neighbor data-structures |AH08[ IHarllj . Nevertheless, the 
resulting data-structure has both worse space and query time than the data-structure we 
present here. 
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In particular, since we solve here the harder variant, it is not clear why one should 
compromise on a weaker data-structure with worse performance. Secondly, for some of the 
applications, like density estimation, there might be a phase change in the distribution of 
distances and approximating k is not acceptable. Conversely, for such applications approx- 
imating the distances is acceptable. Finally, it is not clear how such a fuzzy data-structure 
can be used for the density estimation without some additional overheads that make it 
inherently less applicable for such problems. 

2.2 Basic tools 

For a real positive number a and a point p = (p 1; . . . , p d ) G ]R d , define G a (p) to be the grid 
point ([pi/atj a, . . . , \jPd/®\ <*)■ We call a the width or sidelength of the grid G a . Observe 
that the mapping G Q partitions IR d into cubic regions, which we call grid cells. 

Definition 2.2. A cube is a canonical cube if it is contained inside the unit cube [0, l] d , 
it is a cell in a grid G r , and r is a power of two (i.e., it might correspond to a node in a 
quadtree having [0, l] d as its root cell). We will refer to such a grid G r as a canonical grid. 
Note, that all the cells corresponding to nodes of a compressed quadtree are canonical. 

For a ball b of radius r, and a parameter ijj, let EB(b, ip) denote the set of all the canonical 
cells intersecting b, when considering the canonical grid with sidelength 2 L lo S2 V^J _ Clearly, 
|ffl(M)| = 0{(r/i)) d ). 

A ball b of radius r in IR centered at a point p can be interpreted as a point in R d+1 , 
denoted by b' = (p, r). For a regular point p G IR^, its corresponding image under this 
transformation is the mapped point p' = (p, 0) G IR d+1 . 

Given point u = (ui, . . . , Ud) G IR d we will denote its euclidean norm by ||u||. We will 
consider a point u = (ui, 112, • . • , Ud+x) G IR d+1 to be in the product metric of H d x 1R and 
endowed with the product metric norm 

IMIe = y / u? + --- + u2 + |u d+1 | . 
It is easy to see that the above defines a norm and the following holds for it. 
Lemma 2.3. For any u G IR"'"'" 1 we have ||u|| < ||u|L < \/2||u||. 
The distance of a point to a set under the ||-|| ffi norm is denoted by d e (u, P). 

Simplifying assumption. In the following, we will assume that k divides n; if not one 
can easily add fake points as necessary at infinity. We also assume that the point set P is 
contained in [1/2, 1/2 + 1 /n] d , where n = |P|. This can be achieved by scaling and translation 
(which does not effect the distance ordering). We will assume that the queries are restricted 
to the unit cube U = [0, l] d . 
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2.2.1 Quorum Clustering 

Given a set of n points in H d and a number k > 1, where A;|r;0, we start with the smallest 
ball bi that contains k points of P. Let Pi = P fl bi. We continue on the set of points 
P \ Pi by finding the smallest ball that contains k points of the remaining set of points, 
and so on. Let bi, b2, . . . , b n /k denote the set of balls found by the algorithm and let Pj = 
(P \(Pi U ■ • • U Pj-i)) H bj. Let Q and denote the center and radius, respectively, of bj, 1 < 
i < n/k. A slight symbolic perturbation can guarantee that (i) each ball bj contains exactly 
k points of P, and (ii) all the centers Ci, C2, • . • , q are distinct points. It is easy to see that 
< T2 < • • • < r n /k < diam(P). Such a clustering of P into n/k clusters is termed a quorum 
clustering and an algorithm for computing it is provided in Carmi et al. [CDH + 05] . We 
assume we have a black-box procedure QuorumCluster(P, k) [CDH + 05] that computes 
an approximate quorum clustering. It returns a list of balls, (q, r^) , 1 < i < n/k. The 
algorithm of Carmi et al. [CDH + 05] provides such sequence of clusters, where each ball is 
a 2-approximation to the smallest ball containing k points of the remaining points. The 
following is an improvement over the result of Carmi et al. [CDH+05j . 

Lemma 2.4. Given a set P of n points in IR d and parameter k, one can compute, in 
0{n log n) time, a sequence of n/k balls such that: 

(A) For every ball(ci, i"j) there is an associated subset Pi of k points ofP\ (Pj U . . . U Pj_i) 
that it covers. 

(B) The ball(ci, is a 2-approximation to the smallest ball covering k points inP \ (Pi U 
...UPi_i). 

Proof: The guarantee of Carmi et al. is slightly worse - their algorithm running time is 
0(n \og d n). They use a dynamic data-structure for answering 0{n) queries, that report how 
many points are inside a query canonical square. Since they use orthogonal range trees this 
requires 0(\og d n) time per query. Instead, one can use dynamic quadtrees. More formally, 
we store the points using linear ordering |Harll] using any balanced data-structure. A query 
to decide the number of points inside a canonical node corresponds to an interval query (i.e., 
reporting the number of elements that are inside a query interval) and can be performed in 
O(logn) time. Plugging this data-structure into the algorithm of Carmi et al. |CDH + 05] 
gives the desired result. ■ 

3 A Constant Factor Approximation 

Lemma 3.1. Let P be a set of n points in lR d , and let k > 1 be a number such that k\n. 
Let (ci, ri) , . . . ,(c n /fc, r n /k) be the list of balls returned by QuorumCluster(P, k). Let x = 
mini<j< n / fc (||q — q|| + r { ). We have that x/5 < d fe (q, P) < x. 

Proof: For any 1 < i < n/k we have bj = ball(cj, i"j) C ball(q, ||q — q|| + i"j). Since |b* D P| > 
k, we have d fc (q, P) < ||q - q|| + r*. As such, d fc (q, P) < x = mini<j< n/fc (||q - q|| + i-j). 

1 The notation k\n indicates that k is a positive integer that divides n. 



For the other direction, let % be the minimal integer such that ball(q, dfc(q, P)) contains a 
point of Pj. Then, we have 

h/2 < d fc (q, P) , 

as Xi is a 2- approximation to the radius of the smallest ball that contains k points of P, U 
Pj+i U • • • U P n /k- We also have 

||q- q|| - \i < d fe (q, P) , 

as the distance from q to any u G ball(cj, r^) satisfies ||q — u|| > ||q — q|| — by the triangle 
inequality. Putting the above together, we get 

x= min (||q- Cj|| + r 3 -) < ||q-Ci|| + rj = (||q- Ci|| - r,) +2r, < 5d fc (q, P) . ■ 

l<j<n/k 

Theorem 3.2. Given a set P of n points in IR^, and a number k > 1 such that k\n, one 
can build a data-structure, in 0(n log n) time. The resulting data- structure uses 0(n/k) 
space, and given any query point q G H d one can find a 0(1) -approximation to dfc(q, P) in 
0(}og(n/k)) time. 

Proof: We invoke QuorumCluster(P, k) to compute the clusters (q, r^), for i = 1, . . . , n/k. 

For 2 = 1,... ,n/k, let b- = (q, n) G We preprocess the set B' = jb^, . . . , b' n ^ k \ for 

2-ANN queries (in ]R d+1 ). The preprocessing time for the ANN is 0((n/k)\og(n/k)), the 
space used is 0{n/k) and the query time is 0(\og(n/k)) [Harllj . 

Given a query point q G R the algorithm computes a 2-ANN to q' = (q, 0), denoted by 
b' ; . The algorithm returns llq' — b' ll as the approximated distance. 

J II J 1 1 

Observe that, for any i, we have ||q' - bj|| < ||q' - bj|| ffi < V2 ||q' - b£|| by Lemma O 
As such, the returned distance to a center b^ is a 2-approximation to d(q', J B / ); that is, 

d e (q',#) < ||q'- b;|| ffl < V^Hq'- b;|| < 2v / 2d(q',S') < 2v / 2d e (q / , £') . 

ByLemmaEH d e (q',fi') /5 < d fc (P,q) < d ffi (q / ,i3'). Namely, ||q' - b;|| ffi /(10V2) < d fc (P,q) < 
q' — b'J , implying the claim. ■ 

II J I I 

Remark 3.3. The algorithm of Theorem 13.21 works for any metric space. Given a set P 
of n points in a metric space, one can compute n/k points in a the product space induced 
by adding an extra coordinate, such that approximating the distance to the kth nearest 
neighbor, is equivalent to answering ANN queries on the reduced point set. 

4 Approximate Voronoi Diagram for dfc(q, P) 

Here, we are given a set P of n points in JR d , and our purpose is to build an AVD that 
approximates the fc-ANN distance, while using (roughly) 0{n/k) space. 
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4.1 Construction 



4.1.1 Preprocessing 

(A) Compute a quorum clustering for P using Lemma 12.41 Let the list of balls returned be 
bj = (q, I-;), for i = 1, . . .,n/k. 

(B) Compute an exponential grid around each quorum cluster. Specifically, let 



be the set of grid cells covering the quorum clusters and their immediate environ, where 
Ci is a sufficiently large constant. 
(C) Intuitively, X takes care of region of space immediately next to the quorum clusters. For 
the other regions of space, we can apply (intuitively) a construction of an approximate 
Voronoi diagram for the centers of the clusters (the details are somewhat more involved). 
To this end, let lift the quorum clusters into points in R d+1 , as follows 



where b^ = (q, r$) G H + , for i = 1, . . . , n/k. Note, that all points in B' belong to 
\J> = [0, l] d+1 . We now build (1 + £/8)-AVD for B' using the algorithm of |AM02j . The 
AVD construction provides a list of canonical cubes covering [0, l] d+l such that locating 
the smallest cube containing the query point, has an associated point of B' that is 
(1 + e:/8)-ANN to the query point. (Note, that there cubes are not necessarily disjoint. 
In particular, the smallest cube containing the query point q is the one determines what 
is the ANN of q.) 

We clip this collection of cubes to the hyperplane x^+i = (i.e., we throw away 
cubes that do not have a face on this hyperplane). For a cube □ in this collection, we 
denote by nn'(D) the point of B' assigned to it. Let S be this resulting set of canonical 
cubes. 

(D) Let W be the space decomposition resulting from overlaying the two collection of cubes 
X and S. Formally, we compute a compressed quadtree T that has all the canonical 
cubes of X and S as nodes, and W is the resulting decomposition of space into cells. 
One can overlay two compressed quadtrees representing the two sets in linear time 
|dBHTTTUl IHarllj . Here a cell associated with a leaf is a canonical cube, and a cell 
associated with a compressed node is the set difference of two canonical cubes. Each 
node in this compressed quadtree contains two pointers - one to the smallest cube of 
X, and one to the smallest cube of S that contains it. This information can be easily 
computed by doing a BFS on the tree. 

For each cell □ G W we store the following. 




(1) 



B'={b[,..., 




(I) An arbitrary representative point \3 rep G □. 

(II) The point nn'(D) G B' that is associated with the smallest cell of S that 
contains this cell. 



(Ill) A number (3 k (D rep ) that satisfies d fc (D rep , P) < f3 k (D rep ) < (l+e/4)d fc (D rep , P). 
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4.1.2 Answering a query 



Given a query point q data-structure computes the leaf cell (equivalently the smallest cell) 
from W that contains q by performing a point-location query in T. Let □ be the leaf cell 
that contains q. The data-structure returns 



as the approximate value to dfc(q, P). 

One can also compute a representative point that corresponds to the returned kth nearest 
neighbor distance. To this end, together with nn'(D) we associate an arbitrary point P that 
is associated with this quorum cluster. Similarly, with /3fc(D rep ) one stores the fcth nearest 
neighbor (or approximate k nearest neighbor) to D rep . One returns the point corresponding 
to the distance selected as the desired approximate kth nearest neighbor. 

4.2 Correctness 

Lemma 4.1. Let □ G W and q G □. Then the number computed by the algorithm is an 
upper bound on dfc(q, P). 

Proof: By Observation EZQ d fc (q, P) < d fc (D rep , P) + ||q-n rep || < f3 k (D rep ) + ||q-D rep ||. 
Now, let nn'(D) = (c, r). We also have, by Lemma IXTl that d fc (q, P) < ||q — c|| + r = 
|| q' — nn'(n)|| ffi . As the returned value is the minimum of these two numbers, the claim 
holds. ■ 

Lemma 4.2. Consider any query point q G [0, l] d , and let □ be the smallest cell ofW that 
contains the query point. Then, d(q',B') < ||q' — nn'(D)|| < (1 + e/8)d(q', B'). 

Proof: Observe, that space decomposition generated by W is a refinement of the decom- 
position of space (which is an AVD of £>') generated by the Arya and Malamatos [AM02] 
construction when applied to B' and restricted to the d dimensional subspace we are inter- 
ested in (i.e., Xd+i = 0). As such, nn'(D) is exactly the point returned by the AVD for this 
query point before the refinement, thus implying the claim. ■ 

4.2.1 The query point is close to a quorum cluster of the right size 

Lemma 4.3. Consider a query point q, and let □ C ]R d be any subset with q G □ such that 
diam(D) < £dfc(q, P). Then, for any u G we have 



min( ||q' - nn'(D)|| ffi , + ||q- □ 



rep 



) 



(2) 



(1 - e)d fc (q, P) < d k (u, P) < (1 + e)d fc (cb P) . 



Proof: By Observation 12.11 we have 



d fe (q, P) < d fc (u, P) + ||u - q|| < d fc (u, P) + diam(D) < d*(u, P) + ed k (q, P) . 



The other direction follows by a symmetric argument. 



11 



Lemma 4.4. If the smallest □ £ W i/iai contains q /ias diameter diam(D) < edfc(q, P) /4 
i/ien i/ie algorithm returns a distance which is between d k {q, P) and (1 + e)d k {q, P). 

Proof: Let CUp be the representative stored with the cell. Let a be the number returned by 
the algorithm. By Lemma [4.11 we have that dfc(q, P) < a. Since the algorithm returns the 
minimum of two numbers one of which is (3k(O rep ) + ||q — DrepH we have by Lemma [4.31 

a </3 jt (n rep ) + ||q-n rep || < (l + £/4)d fe (n rep) P) + ||q-n r e P || 

< (1 + e/4)(d fc (q, P) + ||q - DrepH) + sd k (q, P) /4 

< (1 + e/4)(d fc (q, P) + sd k (q, P) /4) + ed k (q, P) /4 

= (1 + e/4) 2 d,(q, P) + ed k (q, P) /4 < (1 + e)d fc (q, P) , 

establishing the claim. ■ 

Definition 4.5. Consider a query point q £ IR d . The first quorum cluster bj = ball(cj, r*) 
that intersects ball (q, (q, P)) is the anchor cluster of q. The corresponding anchor point 
is (ci, ri ) £ B. d+1 . 

Lemma 4.6. For any query point q, we have that 
(i) the anchor point (c, r) is well defined, 
(u) r<2d fe (q,P) ; 

(Hi) for b = ball(c, r) we have b D ball(q, d fc (q, P)) ^ 0, and 
(iv) ||q- c|| < 3d fc (q,P). 

Proof: Consider the k closest points to q in P. As P C b x U • • • U b n / k it must be that 
bal I (q, dfc(q, P)) intersects some bj. Consider the first cluster ball(c, r) in the quorum clus- 
tering that intersects ba II (q, d^(q, P)). Then (c, r) is by definition the anchor point and we 
immediately have ball(c, r) D ball(q, dfc(q, P)) ^ 0. 

As for (ii), observe that when this quorum cluster was computed, ball(q, dfe(q, P)) provided 
a ball that contains k points. Since the algorithm computes a 2-approximation to the smallest 
ball with this property, it follows that r < 2d k {q, P). 

Finally, as for (iv), we have r < 2dfc(q, P) and the ball around q of radius dfc(q, P) intersects 
the ball ball(c, r) thus implying that ||q — c|| < dfe(q, P) + r < 3d fe (q, P). ■ 

Lemma 4.7. Consider a query point q. If there is a cluster ball (c, r) in the quorum clustering 
computed, such that ||q— c|| < Qd k (q, P) and £d^(q, P) /4 < r < 6d k (q, P) then the output of 
the algorithm is correct. 



Proof: We have 



32r^ 32(£d fc (q,P)/4) 

— > > od fc (q, P) > ||q ■ 



Thus, by construction, the expanded environ of the quorum cluster ball(c, r) contains the 
query point, see Eq. (JTJ). As such the smallest quadtree cell □ that contains q has sidelength 
at most 

^•max(r,2||q-c||) < . max(6d fe (q, P) , 12d fc (q, P) ) < ^d fe (q, P) , 

by Eq. (CQ) and if £i > 48. As such, diam(D) < ed k (q, P) /4, and the claim follows by 
Lemma 14.31 ■ 
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4.2.2 The general case 

Lemma 4.8. The data- structure constructed above returns (l+e)- approximation to dfc(q, P), 
for any query point q. 

Proof: Consider the query point q and its anchor point (c, r). By Lemma 14. 6[ we have 
r < 2d fc (q, P) and ||q — c|| < 3dfc(q, P). This implies that 

d(q',#) < ||q'-(c,r)|| < ||q - c|| + r < 5d,(q, P) . (3) 

Let the returned point, which is a (1 + e/8)-ANN for q' in £>', be (c q , r q ) = nn'(D), where 
q' = (q,0). We have that ||q' - (c q , r q )|| < (1 + e/8)d(q', B') < 6d fc (q, P). In particular, 
||q - c q || < 6d fc (q, P) and r q < 6d fc (q, P). 

Thus, if r q > edfe(q, P) /4 or r > £dfc(q, P) /4 then we are done, by Lemma |4~T1 Otherwise, 
We have 

||q'-(c q ,r q )|| < (l + e/8) ||q'-(c, r)||, 
as (c q , r q ) is a (1 + e/8) approximation to d(q', B'). As such, 



< Hq 7 — (c,r)|| < ||q — c|| + r. (4) 



l + e/8 

As ball(c, r) D ball(q, d&(q, P)) ^ we have, by the triangle inequality, that 

||q-c|| -r< d fc (q,P). (5) 
By Eq. (j4j) and Eq. (jSj) we have 



|q -(Cq,r q J 



2r< ||q-c|| - r < d fc (q, P) . 



l + e/8 

By the above and as max(r, r q ) < ed fc (q, P) /4, we have 

||q - c q || + r q < ||q' -(c q , r q )|| + r q < (1 + e/8)(d fc (q, P) + 2r) + r q 

< (1 + e/8)(d fe (q, P) + £d fc (q, P) /2) + ed fe (q, P) /4 < (1 + e)d fc (q, P) . 

Since the algorithm returns for q a value that is at most ||q — c q || + r q the result is correct. ■ 
4.3 The result 

Theorem 4.9. Given a set P of n points in ]R d , a number k > 1 such that k\n, and < 
e sufficiently small, one can preprocess P, in O ( nlogn + — C e logn + yC' E ) time, where 

C e = O^^^loge -1 ) and C' £ = 0(e~ 2d+1 loge^ 1 ) . The space used by the data- structure is 
0{C e n/k). This data structure answers (1 + e)- approximate k nearest neighbor query in 

O^log— J time. The data- structure also returns a point of P that is approximately the 

desired k approximate nearest neighbor. 
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Proof: Computing the quorum clustering takes time O(nlogn) by Lemma [2.41 It is easy to 
see that \X\ = 0(^log|). From the construction in [AM02] we have |«S| = O(^logi) 
(note, that since we clip the construction to a hyperplane, we get l/e d in the bound and not 
l/e d+l ). A careful implementation of this stage takes time 0(nlogn + \W\ (logn + ^h=t))- 
Overlaying the two compressed quadtrees representing them takes linear time in their size, 
that is 0(|;f| + |<S|). 

The most expensive remaining step is to perform the k approximate nearest neighbor 
query for each cell in the resulting decomposition of W, see Eq. (j2J) (i.e., computing /3 fc (D rep ) 
for each cell □ G W). Using the data-structure of Section [6] (see Theorem 16.3 j) each query 
takes 0(\ogn + l/e^" 1 ) time (alternatively, we could use the data-structure of Arya et al. 
|AMM05j ). As such, this takes 

O \n log n + | W| ^log n + -^z^j J = O log n + log ^ log n + ^ d _ x l°g 

time, and this bounds the overall construction time. 

The query time is a point location query and is easily seen to take time 0(log(^)). 

Finally, one needs to argue that the returned point of P is indeed the desired approximate 
k nearest neighbor. It follows by going through our correctness proof and applying it to the 
returned point that the distance to it is indeed a 1 + 0(e) approximation to the k nearest 
neighbor distance. We omit the tedious but straightforward details of doing so. ■ 



4.3.1 Using a single point for each AVD cell 

The AVD generated can be viewed as storing two points in each cell □ of the AVD. These 
two points are in ]R d+1 , and for a cell □, they are 

(i) the point nn'(D) € B', and 

(ii) the point (D rep , (3 k (D rep )). 

The algorithm for d&(q, P) can be viewed as computing the nearest neighbor of (q, 0) to one of 
the above two points using the || ■ norm to define the distance. Furthermore, we can use the 
regular ||-|| to resolve which one of the points to use. Using standard AVD algorithms we can 
subdivide each such cell □ into 0{l/e d+1 loge -1 ) cells to answer this query approximately. 
By using this finer subdivision we can have a single point inside each cell for which the 
closest distance is the approximation to dfc(q, P). This incurs an increase by a factor of 
0(l/e d+1 loge -1 ) in the number of cells. 



5 Density estimation 

Given a point set P C ]R d , and a query point q G IR d consider the point v(q) = (di(q, P) , . . . , d n (q, 
This is a point in IR n and several problems in computational geometry can be viewed as com- 
puting some interesting function of v(q). For example one could view the nearest neighbor 
distance as the function that projects along the first dimension. Another motivating example 
is a geometric version of discrete density measures from |GMMll] . In their problem one is 
interested in computing <7fc(q) = J^ =1 dj(q, P). In this section, we show that a broad class 
of functions (that include can be approximated to within (1 ± e) by a data structure 
requiring space 0{n/k). 
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5.1 Basic tools 



Definition 5.1. A monotonia increasing function f : IR + — > H is said to be slowly grow- 
ing if there is a constant c > such that for e sufficiently small we have (1 — e)f(x) < 
f((l — e/c)x) < f((l + e/c)x) < (1 + e)f(x), for all x G IR + . The constant c is the growth 
constant of f. The family of slowly growing functions is denoted by J 7 ^. 

It is easy to see that the class F sg includes polynomial functions, but it does not include, 
for example, the function e x . We assume that given a x we can evaluate the function f(x) in 
constant time. In this section, we show how we can use the AVD construction to approximate 
any function J~k,f{') that can be expressed as 

k 

J 7 *,/(q) = E/(d*(<L p ))' 

i=l 

where / G jF sg . As f(x) = x 2 is slowly growing we have the function above = J^kj- 

Lemma 5.2. Let /^(q) = ^f=rjfc E /8l /(dt(q> P))- Then, for any query point q, we have that 
f?j(q)<T k , f (q)<(l + e/4)f* f ( q ). 

Proof: The first inequality is obvious. As for the second inequality, observe that d,(q, P) is 
a monotonically increasing function of i, and so is /(dj(q, P)). We are dropping the smallest 
k(e/8) terms of the summation J-fc,/(q) that is made out of k terms. As such, the claim 
follows. ■ 

The next lemma exploits a coreset idea, so that we have to evaluate only few terms of 
the summation. 

Lemma 5.3. There is a set of indices I C | [fce/8] , . . . , k}, and integer weights Wi > 0, 
for % G T, such that: 

(A) \T\ = O(^). 

(B) For any query point q, we have that J-"^(q) = w if(di(l, P)) is a good estimate 
for /« 7 (q); that is, (1 - e/4) ^(q) < /^(q) < (1 + e/4)T~ f (q). 

Furthermore, the setZ can be computed in 0(\I\) time. 

Proof: Given a query point q consider the function g q : {1, 2, . . . , n} — > 1R + defined as g q (i) = 
/(dj(q, P)). Clearly, since / G J-" S g, it follows that g q is a monotonic increasing function. The 
existence of X follows from Lemma 3.2 of |Har06] . when (l±e/4)-approximating the function 
fk,M) = Et rte /8l /(di(q, P))j that is, (1 - e/4)J^(q) < /^(q) < (1 + e/^/q). ■ 

5.1.1 Performing point-location in several quadtrees simultaneously 

Lemma 5.4. Consider a rooted tree T with m nodes, where the nodes are colored by I 
colors (i.e., a node might have several colors). Overall, assume there are 0(m) pairs of such 
(node, color) associations. One can preprocess the tree in 0(m) time and space, such that 
given a query leaf v of T , one can report the nodes Vi, . . . ,Vj in 0(1) time. Here, is the 
lowest node in the tree along the path from the root to v that is colored with color i. 
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Proof: Let us start with the naive solution - perform a DFS on T, and keep an array X of / 
entries storing the latest node of each color encountered so far along the path from the root 
to the current node. Storing a snapshot of this array X at each node would require 0(ml) 
space. But then one can answer a query in 0(1) time. As such, the challenge is to reduce 
the required space. 

To this end, interpret the DFS to be a Eulerian traversal of the tree. The traversal has 
length 2m — 2, and every edge traveled contains updates to the array X. Indeed, if the DFS 
traverses down from a node u to a child node w, the updates would be updating all the colors 
that are stored in w, to indicate that w is the lowest node for these colors. Similarly, if the 
DFS goes up from w to u, we restore all the colors stored in w to their value just before the 
DFS visited w. Now, the DFS traversal of T becomes a list of 0(m) updates. For each leaf 
we know its location in this list of updates, and we are interested in the last update before 
it in the list for each one of the / colors. 

So, let L be this list of updates. At each kth update, for k = tl for some integer t, store a 
snapshot of the array of colors as updated if we scan the list from the beginning till this point. 
Clearly, all these snapshots can be computed in 0(m) time, and require 0((m/I)I) = 0(m) 
space. 

Now, given a query leaf v, we go to its location in the list L, and jump back to the last 
snapshot stored. We copy this snapshot, and then we scan the list from the snapshot till 
v. This would require updating the array of colors at most 0(1) times, and can be done in 
0(1) time overall. ■ 

Lemma 5.5. Given I compressed quadtrees Vi, . . . , Dj of total size m in lR d , one can pre- 
process them in 0(m\ogI) time, using 0(m) space, such that given a query point q, one 
can perform a point-location queries in all I quadtrees, simultaneously for q, in 0(logm + 1) 
time. 

Proof: Overlay all these compressed quadtrees together. Since we are overlaying together / 
quadtrees, and this is equivalent to merging / sorted lists [Harllj . this takes O(mlogf) time. 
Let V denote the resulting compressed quadtree. Note, that any node of T>i, for i — 1, . . . , J, 
must be a node in T>. 

Given a query point q, we need to extract the I nodes in the original quadtrees Vi, for 
i = 1, . . . ,1, that contain the query point (these nodes can be compressed nodes). So, let □ 
be the leaf node of T> containing the query point q. Consider the path 7r from the root to 
the node of □. We are interested in the lowest node of ir that belongs to T>i, for % — 1, . . . , /. 
To this end, color all the nodes of T>i that appear in T> by color i, for % — 1, . . . , I. Now, we 
build the data-structure of Lemma [5.41 for T>. We can use this data-structure to answer the 
desired query in 0(1) time. ■ 

5.2 The data- structure 

We are given P C ]R d set of n points, a function / G J-" S g, an integer 1 < k < n, and e > 
sufficiently small. Here we describe how to build a data-structure to approximate J~kj(')- 
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5.2.1 Construction 



In the following, let a = 0(c) be a sufficiently large constant, where c is the growth constant 
of / (see Defnition l5.ll) . Consider the coreset X from Lemma [5731 For each i £ X we compute, 
using Theorem 14. 9[ a data structure (i.e., a compressed quadtree) T>i for answering (1+e/a)- 
approximate zth nearest neighbor distance queries for P. We then overlay all these quadtrees 
into a single quadtree, using Lemma 15.51 

Answering a Query. Given a query point q, we perform simultaneous point-location 
query in T>i, . . . , T>j, by using X>, as described in Lemma 15 .51 This results in a (1 + e/Ac) 
approximation to dj(q, P), for i £ X, and takes 0(logm + I) time, where m is the size of 
V, and / = |X|. We return z = J2iex w if(. z i) where Wi are the weights associated with the 
members of the coreset from Lemma 15.31 

Bounding the quality of approximation. We only prove the upper bound on z. The 
proof for the lower bound is similar. As the z% are (1 ± e/4c) approximations to dj(q, P) we 
have, (1 — e/4c)zi < d«(q, P), for % £ X, and it follows from definitions that, 

(1 - e/Qwiffy < wj((l - e/Ac)z^ < ^/(d^q, P)) , 

for i £ X. Therefore, 

(1 - e/4)z = (1 - e/4) < P)) = ^(q). (6) 

Using Eq. (jBJ) and Lemma 15731 it follows that, 

(1 - e/Afz < (1 - e/4)^(q) < /^(q). (7) 

Finally, by Eq. Q and Lemma [5.21 we have, 

(l-E/4) 2 z<f~ f ( q )<T k j(q). 

Therefore we have, (1 — s)z < (1 — e/4) 2 2; < Xfc 5 /(q), as desired (i.e., this is equivalent to 
z< J- fei/ (q)/(l-£)). 

Preprocessing space and time analysis. We have that I = \X\ = 0(e~ 1 \ogk). Let 
Cx = 0(x _d logx _1 ). By Theorem 14.91 the total size of all the PjS (and thus the size of the 
resulting data-structure) is 

»=E°W)=°(<^). (-») 

Indeed, the maximum of the terms involving n/i is 0(n/ke) and / = 0(e _1 log fc). By 
Theorem 14.91 the total time taken to construct all the T>i is 

^ / n ra , \ /nlognlogfc nlognlog/c nlogA; , \ 

2^ O (ra log n + j C £/Q log n + -O e/a J = O I + ^ £ / Q + -^-C e/a 1 , 

where C^. = 0(x~ 2d+1 logx -1 ) . The time to construct the final quadtree is 0(5 log/), but 
this is subsumed by the construction time above. 
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5.2.2 The result 



Summarizing the above, we get the following result. 

Theorem 5.6. For P be a set of n points in IR d . Given any f G J-~ S g, an integer 1 < k < n 
and e > 0, one can build a data- structure to approximate J-fcj(-). Specifically, we have: 

(A) The construction time is 0{C\n logn log k), where C\ = O^^ 2 ^" 1 loge -1 ) . 

(B) The space used is O ^C*2 ^ log k j , where C2 = 0(e~ d ~ 2 loge" 1 ) . 

(C) For any query point q, the data- structure computes a number z, such that (1 — e)z < 
^,/(q) < {l + e)z. 



(D) The query time is 01 logn + 




(The O notation here hides constants that depends on f .) 



6 ANN queries where k and e are part of the query 

We present here a data-structure with query time 0(logn + l/e^ 1 ) that does not require 
knowing either k or e during the preprocessing stage, and both are specified during query 
time (together with the query point). Unlike our main result, this data-structure requires 
linear space. Previous data-structures required knowing e in advance |AMM05j . 



6.1 Rough approximation 

Observe that a fast constant approximation to dfc(q, P) is implied by Theorem 13.21 if k is 
known in advance. We next describe a polynomial approximation when k is not available 
during preprocessing. We sketch the main ideas in the following, our argument closely follows 
the exposition in |Harllj . 

Lemma 6.1. Given a set P of n points in lR d , one can preprocess it, in O(nlogn) time, 
such that given any query point q and 1 < k < n, one can find a number R satisfying, 
dfc(q, P) < R < n c dfc(q, P) in O(logn) time. The result is correct with high probability i.e. at 
least 1 — l/n c ~ 2 for any constant c > 4. 

Proof: By a scaling and translation one may assume that P C [l/2,3/4] d . Next, consider a 
compressed quadtree decomposition T of b + [0, l] d for P, whose shift b is a random vector 
in [0, l/2] d . For each node v of T compute the number of points of P contained in its subtree, 
and the axis parallel bounding box B v of the point set stored in this subtree. This can be 
computed by a bottom-up traversal. 

Given a query point q G [l/2,3/4] d , one locate the lowest node v of T whose region 
contains q (this takes O(logn) time, see |Harllj ). By performing a binary search on the root 
to v path one can locate the lowest node whose subtree contains k or more points from P. 
Let rg^ denote the region associated with the node. The algorithm returns R, the distance 
of the query point to the furthest point of B Vk (i.e., the bounding box of all the points stored 
in this subtree) as the approximate distance. 
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To see the quality of approximation consider the smallest ball b centered at q with radius 
r = dfc(q, P) that contains k points. Next, consider the smallest canonical grid having 
side length a > n c ~ 1 r (thus, a < 2n c ~ 1 r). Randomly translating this grid, we have with 
probability > 1 — rd/a > 1 — l/n c ~ 2 that this ball is contained inside a canonical cell □ of 
this grid. This in turn implies that the diameter of B Vk is bounded by y/da, Indeed, if the 
cell of z/j; is contained in □ then this clearly holds. Otherwise, if □ is contained in the cell 
Vki then v\. must be a compressed node, the inner portion of its cell is contained in □, and 
the outer portion of the cell can not contain any point of P. As such, the claim holds. 

As such, for the returned distance R, we have that 

r = d fc (q, P) < R < diam^J + r < \fd~a + r < \fd2n c ~ l r + r < n c r. 

■ 

An alternative to the argument used in Lemma I6.1[ is to use two shifted quadtrees, 
and return the smaller distance returned by the two trees. It is not hard to argue that in 
expectation the returned distance is 0(l)-approximation to the desired distance (which then 
implies the desired result via Markov's inequality). One can also derandomize the shifted 
quadtrees and use d + 1 quadtrees in instead. See jHarll] . 

We next show how to refine this approximation. 

Lemma 6.2. Given a set P of n points in IR ', one can preprocess it, in 0{n\ogn) time, so 
that given a query point q and an estimate R satisfying d&(q, P) < R < n°' 1 ^djfc(q, P), then 
one can output a number (3 satisfying, dfc(q, P) < /3 < (1 + e)dfc(q, P), in 0(\ogn + l/e^" 1 ) 
time. Furthermore, one can return a point p G P such that (1 — e)dfc(q, P) < ||q— p|| < 
(l + e)d*(q,P). 

Proof: We assume that PU {q} C [1/2, 1/2 + l/n] d . The algorithm of Lemma [67T1 return the 
distance between q and some point of P and as such the returned approximation R satisfies 
dfc(q, P) < R < n°Wdfc(q, P) < diam(P U {q}) < d/n. We start with a compressed quadtree 
for P having U = [0, l] d as the root. We look at the set of canonical cells Xq, with side length 
at least R that intersect the ball ball(q, R). It is guaranteed that the kth nearest neighbor 
of q lies in this set of cubes. The set X can be computed in O(|X | logn) time using cell 
queries [Harllj . 

For each node v in the compressed quadtree there is a level associated with it. This 
is lvl(v) = log 2 sidelength(D„). The root has level and it decreases as we go down the 
compressed quadtree. Intuitively, —C{y) is the depth of the node if it was a node in a regular 
quadtree. 

We maintain a queue of such canonical grid cells. Each step in the search consists of 
replacing cells in the current level with their children in the quadtree, and then deciding if 
we want to descend a level. In the ith iteration, we replace every node of Xi_i by its children 
in the next level, and put them into the set X^ 

We then update our estimate of d^(q, P). Initially, we set I = [l , h ] = [0, R]. For every 
node t> G Xj, we compute the closest and furthest point of its cube (that is the cell of this 
node) from the query point (this can be done in 0(1) time), this corresponds to an interval 
of distances I v . Let n v denote the number of points stored in the subtree of v. For a real 
number x, let L(x), M(x), R(x) denote the total number of points in these intervals that are 
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intervals that are to the left of x, contains x, and are to the right of x, respectively. Using 
median selection, one can compute in linear time (in number of nodes of Xj) the minimum 
x such that L(x) > k iy let this value be hj. Similarly, one can, in linear time, compute the 
minimum x such that L(x)+M(x) > ki, and let this value be lj. Clearly, the desired distance 
is in the interval It = [U, hj]. 

The algorithm now iterates over v G X^ If I v is strictly to the left of lj, then the 
algorithm throws this node away (it is too close to the query and can not contain the kth 
nearest neighbor), setting k <— k — n v . Similarly, if /„ is to the right of h it can be thrown 
away. One this done, the algorithm moves to the next iteration. 

The algorithm stops as soon as the diameter of all the cells of Xi is smaller than (er/8)U. 
The algorithm then picks a representative point from each node of Xi (each node of the 
quadtree has an arbitrary representative point computed for it out of the subset of points 
stored in its subtree), and returns (say) the furthest such point as the desired approximate 
nearest neighbor. To see what the returned answer is indeed correct, observe that lj < 
dfc(q, P) < hj and hj — lj < (er/8) U, which implies the claim. The returned representative point 
distance from q is in the interval [a,/3], where a — U — (e/8)lj and (3 = hj < lj + (e/8)\i < 
(1 + e/2)(l — e/8)lj < (1 + e/2)a. This interval also contains dfc(q, P). As such, /3 is indeed 
the required approximation. 

Since we are working with compressed quadtrees, a child node might be many levels 
below the level of its parent. In particular, if a node's level is below the current level, we 
freeze it and just move it on the set of the next level. We replace it by its children only when 
its level had been reached. 

As for the running time, clearly the algorithm takes time 0(|Xo| log n + |Aj|). In 
particular, let Aj be the diameter of the cells in the level being handled in the ith iteration. 
Clearly, we have that hj < lj + Aj. All the cells of Aj that survive must intersect the 
ring with radiuses [lj, hj] around q. As such, by a simple packing argument, |Xj| < rii = 
0((lj/Aj + l)^ 1 ). As long as A, > d fc (q, P), we have that rij = 0(1), as lj < d fc (q, P). 
This clearly holds for the first O(logn) iterations. It is now easy to verify that once this 
no longer holds, the algorithm performs at most [log 2 (l/e)] +0(1) additional iterations, as 
then Aj < (e:/16)dfc(q, P) and the algorithm stops. It is now easy to verify that the rijS in 
this range can grow exponentially, with the last one being 0(l/e d ~ 1 ). Thus implying that 
JA \X-i\ = 0(\ogn + l/e d ~ l ) ) as desired. ■ 

6.2 The result 

Theorem 6.3. Given a set of n points P in !R d , one can preprocess them, in O(nlogn) 
time, into a data structure of size 0(n), such that given a query point q, an integer 1 < 
k < n and a e > one can compute, in 0(logri + l/e d ~ l ^j time, a number (3 such that 
dfc(q, P) < /? < (1 + e)dfc(q, P). The data- structure also returns a point p G P such that 
(l-£)d fc (q,P)<||q-p||<(l+£)d fc (q,P). 

6.3 A generalization — Weighted version of k ANN 

We consider a generalization of the k ANN problem where as usual we are given a set of 
points P C ]R d , weights w p > for each p G P and a number e > 0. A query consists of a 
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point q and a number x > and we are required to output a distance d, where d < (1 +e)D. 
Where D is the smallest distance such that such that 



The usual k ANN problem is just the specialization with w p = 1 for all p and x = k. We 
remark that the algorithm for k ANN presented in this section, works with minor changes 
for this generalization as well, even when e is supplied with the query point. Furthermore, 
the run time and space complexity remain unchanged and do not depend on the weights w p . 

7 Conclusions 

In this paper, we presented a data-structure for answering approximate k nearest neighbor 
queries m H d (here d is a constant). Our data-structure has the surprising property that 
the space required is 0(n/k). It is easy to verify that up to noise this is the best one can do 
for this problem. This data-structure also suggests a natural way of compressing geometric 
data, such that the resulting sketch can be used to answer meaningful proximity queries 
on the original data. We then used this data-structure to answer various proximity queries 
using roughly the same space and query time. 

We also presented a data-structure for answering such queries where both k and e are 
specified during query time. This data-structure is simple and practical. 

There are many interesting questions for further research. 

(A) In the vein of the authors recent work jHKlla] , it is to easy to verify that our 
results extends in a natural way to metrics of low doubling dimensions ( |HKllaj 
describes what an approximate Voronoi diagram is for doubling metric). It also 
seems believable that the result would extend to the problem where the data is 
high dimensional but the queries arrive from a low dimensional manifold. 

(B) It is natural to ask what one can do for this problem in high dimensional Euclidean 
space. In particular, can one get query time close to the one required for approx- 
imate nearest neighbor |IM98j . Of particular interest is getting a query time that 
is sublinear in k and n while having subquadratic space and preprocessing time. 

(C) The dependency on e in our data-structures is probably not optimal. One can 
probably get space/time tradeoffs, as done by Arya et al. |AMM09] . 
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